Running replicates statistics on bins and bed files #80

cnluzon · 2020-10-07T13:25:04Z

Included a bwstats.R with functionality to compute statistics based on DESeq2 package for both bins and BED files.

Added functions: bw_bed_diff_analysis, bw_bins_diff_analysis. These operate on two lists of bigWig files, plus labels (labels are mandatory at this point because I believe this makes analysis more robust to mistakes like forgetting what was compared against what). So at this point you need to provide at least "treated" "untreated" or some labels to identify the groups of bigWig files.

This is a work in progress. I still need to add some automated testing. Even though at this point I am not including actual values testing (relying on DESeq2) I am going to include at least tests to check that it runs and that the parameters passed go where they are supposed to go.

Since we mostly operate on scaled bigWig files, the estimateSizeFactors step that usually is performed with DESeq2 is overridden. But a parameter estimate_size_factors is provided:

bw_bins_diff_analysis(bwlist_1, bwlist2, "treated", "untreated", estimate_size_factors = TRUE)

will run a normal DESeq2 function. I provide this for the cases where we are not looking at scaled bigWig files. But the default value of this parameter is set to FALSE.

Returning value of these functions is a results table as the one obtained when you call DESeq2::results, which means that for any analysis you still need to set some cutoff thresholds for pvalue and / or fold change.

shaorray · 2020-10-07T13:44:55Z

One detail about DESeq's testing: the input counts need to be an integer larger than 1, better to be above 100.

Since it transforms read counts (or bin counts) by a hyperbolic function, smaller numbers (1, 10) are more sensitive for DESeq than larger ones, like 1000.

Because rsamtools::summary gives read density in the bins, the counts are normally below 1. So either multiplying the bin width or any large constant number will be a good idea to avoid artefacts.

cnluzon · 2020-10-07T13:51:34Z

My idea was to simulate the "real" number of reads using estimated fragment length and bin size but then I thought a constant value independent of length may be better if a BED file is used that is not bins but something else where length may vary. Of course one could argue that in any case if there are large differences in length it would affect results in both ways.

I will use a larger constant for the count matrix to ensure results are more robust. Thanks!

cnluzon added 4 commits October 7, 2020 11:09

Added DESeq2 as dependency

a2dc226

bwstats with bins analysis

f46c06d

bw_bed_diff_analysis and refactor

7198c10

DESeq2 dependency on CI

f93316d

cnluzon added 2 commits October 7, 2020 16:19

Length factor scaled to 1000 as default

2c817e7

few parameter tests

888c23b

cnluzon merged commit 1f4993c into master Oct 7, 2020

cnluzon deleted the deseq-func branch October 7, 2020 16:41

cnluzon mentioned this pull request Oct 7, 2020

Plot differential analysis results #82

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running replicates statistics on bins and bed files #80

Running replicates statistics on bins and bed files #80

cnluzon commented Oct 7, 2020

shaorray commented Oct 7, 2020

cnluzon commented Oct 7, 2020

Running replicates statistics on bins and bed files #80

Running replicates statistics on bins and bed files #80

Conversation

cnluzon commented Oct 7, 2020

shaorray commented Oct 7, 2020

cnluzon commented Oct 7, 2020